4 research outputs found

    AsterixDB: A Scalable, Open Source BDMS

    Full text link
    AsterixDB is a new, full-function BDMS (Big Data Management System) with a feature set that distinguishes it from other platforms in today's open source Big Data ecosystem. Its features make it well-suited to applications like web data warehousing, social data storage and analysis, and other use cases related to Big Data. AsterixDB has a flexible NoSQL style data model; a query language that supports a wide range of queries; a scalable runtime; partitioned, LSM-based data storage and indexing (including B+-tree, R-tree, and text indexes); support for external as well as natively stored data; a rich set of built-in types; support for fuzzy, spatial, and temporal types and queries; a built-in notion of data feeds for ingestion of data; and transaction support akin to that of a NoSQL store. Development of AsterixDB began in 2009 and led to a mid-2013 initial open source release. This paper is the first complete description of the resulting open source AsterixDB system. Covered herein are the system's data model, its query language, and its software architecture. Also included are a summary of the current status of the project and a first glimpse into how AsterixDB performs when compared to alternative technologies, including a parallel relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data analytics platform, for things that both technologies can do. Also included is a brief description of some initial trials that the system has undergone and the lessons learned (and plans laid) based on those early "customer" engagements

    Progressive Approach To Entity Resolution

    No full text
    Data-driven technologies such as decision support, analysis, and scientific discovery tools have become a critical component of many organizations and businesses. The effectiveness of such technologies, however, is closely tied to the quality of data on which they are applied. That is why today organizations spend a substantial percentage of their budgets on cleaning tasks such as removing duplicates, correcting errors, and filling missing values, to improve data quality prior to pushing data through the analysis pipeline. Entity resolution (ER), the process of identifying which entities in a dataset refer to the same real-world object, is a well-known data cleaning challenge. This process, however, is traditionally performed as an offline step prior to making the data available to analysis. Such an offline strategy is simply unsuitable for many emerging analytical applications that require low latency response (and thus can not tolerate delays caused by cleaning the entire dataset) and also in situations where the underlying resources are constrained or costly to use. To overcome these limitations, we study in this thesis a new paradigm for ER, which is that of progressive entity resolution. Progressive ER aims to resolve the dataset in such a way that maximizes the rate at which the data quality improves. This approach can help in substantially reducing the resolution cost since the ER process can be prematurely terminated whenever a satisfying level of quality is achieved.In this thesis, we explore two aspects of the ER problem and propose a progressive approach to each of them. In particular, we first propose a progressive approach to relational ER, wherein the input dataset consists of multiple entity-sets and relationships among them. We then propose a parallel approach to entity resolution using the popular MapReduce (MR) framework. The comprehensive empirical evaluation of the two proposed approaches demonstrates that they achieve high-quality results using limited amounts of resolution cost
    corecore